591 research outputs found
On the maximal sum of exponents of runs in a string
A run is an inclusion maximal occurrence in a string (as a subinterval) of a
repetition with a period such that . The exponent of a run
is defined as and is . We show new bounds on the maximal sum of
exponents of runs in a string of length . Our upper bound of is
better than the best previously known proven bound of by Crochemore &
Ilie (2008). The lower bound of , obtained using a family of binary
words, contradicts the conjecture of Kolpakov & Kucherov (1999) that the
maximal sum of exponents of runs in a string of length is smaller than Comment: 7 pages, 1 figur
Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries
Longest common extension queries (LCE queries) and runs are ubiquitous in
algorithmic stringology. Linear-time algorithms computing runs and
preprocessing for constant-time LCE queries have been known for over a decade.
However, these algorithms assume a linearly-sortable integer alphabet. A recent
breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the
two notions: all the runs in a string can be computed via a linear number of
LCE queries. The first to consider these problems over a general ordered
alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an
-time algorithm for answering LCE queries. This
result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to time. In this work we note a special \emph{non-crossing} property
of LCE queries asked in the runs computation. We show that any such
non-crossing queries can be answered on-line in time, which
yields an -time algorithm for computing runs
Searching of gapped repeats and subrepetitions in a word
A gapped repeat is a factor of the form where and are nonempty
words. The period of the gapped repeat is defined as . The gapped
repeat is maximal if it cannot be extended to the left or to the right by at
least one letter with preserving its period. The gapped repeat is called
-gapped if its period is not greater than . A
-subrepetition is a factor which exponent is less than 2 but is not
less than (the exponent of the factor is the quotient of the length
and the minimal period of the factor). The -subrepetition is maximal if
it cannot be extended to the left or to the right by at least one letter with
preserving its minimal period. We reveal a close relation between maximal
gapped repeats and maximal subrepetitions. Moreover, we show that in a word of
length the number of maximal -gapped repeats is bounded by
and the number of maximal -subrepetitions is bounded by
. Using the obtained upper bounds, we propose algorithms for
finding all maximal -gapped repeats and all maximal
-subrepetitions in a word of length . The algorithm for finding all
maximal -gapped repeats has time complexity for the case
of constant alphabet size and time complexity for the
general case. For finding all maximal -subrepetitions we propose two
algorithms. The first algorithm has time
complexity for the case of constant alphabet size and time complexity for the general case. The
second algorithm has
expected time complexity
On the maximal number of cubic subwords in a string
We investigate the problem of the maximum number of cubic subwords (of the
form ) in a given word. We also consider square subwords (of the form
). The problem of the maximum number of squares in a word is not well
understood. Several new results related to this problem are produced in the
paper. We consider two simple problems related to the maximum number of
subwords which are squares or which are highly repetitive; then we provide a
nontrivial estimation for the number of cubes. We show that the maximum number
of squares such that is not a primitive word (nonprimitive squares) in
a word of length is exactly , and the
maximum number of subwords of the form , for , is exactly .
In particular, the maximum number of cubes in a word is not greater than
either. Using very technical properties of occurrences of cubes, we improve
this bound significantly. We show that the maximum number of cubes in a word of
length is between and . (In particular, we improve the
lower bound from the conference version of the paper.)Comment: 14 page
Efficient Seeds Computation Revisited
The notion of the cover is a generalization of a period of a string, and
there are linear time algorithms for finding the shortest cover. The seed is a
more complicated generalization of periodicity, it is a cover of a superstring
of a given string, and the shortest seed problem is of much higher algorithmic
difficulty. The problem is not well understood, no linear time algorithm is
known. In the paper we give linear time algorithms for some of its versions ---
computing shortest left-seed array, longest left-seed array and checking for
seeds of a given length. The algorithm for the last problem is used to compute
the seed array of a string (i.e., the shortest seeds for all the prefixes of
the string) in time. We describe also a simpler alternative algorithm
computing efficiently the shortest seeds. As a by-product we obtain an
time algorithm checking if the shortest seed has length at
least and finding the corresponding seed. We also correct some important
details missing in the previously known shortest-seed algorithm (Iliopoulos et
al., 1996).Comment: 14 pages, accepted to CPM 201
Minimal Forbidden Factors of Circular Words
Minimal forbidden factors are a useful tool for investigating properties of
words and languages. Two factorial languages are distinct if and only if they
have different (antifactorial) sets of minimal forbidden factors. There exist
algorithms for computing the minimal forbidden factors of a word, as well as of
a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an
algorithm that, given the trie recognizing a finite antifactorial language ,
computes a DFA recognizing the language whose set of minimal forbidden factors
is . In the same paper, they showed that the obtained DFA is minimal if the
input trie recognizes the minimal forbidden factors of a single word. We
generalize this result to the case of a circular word. We discuss several
combinatorial properties of the minimal forbidden factors of a circular word.
As a byproduct, we obtain a formal definition of the factor automaton of a
circular word. Finally, we investigate the case of minimal forbidden factors of
the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
On-line construction of position heaps
We propose a simple linear-time on-line algorithm for constructing a position
heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap
differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it
considers the suffixes ordered from left to right. Our construction is based on
classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees
[Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into
the augmented position heap that allows for a linear-time string matching
algorithm [Ehrenfeucht et al, 2011].Comment: to appear in Journal of Discrete Algorithm
A Minimal Periods Algorithm with Applications
Kosaraju in ``Computation of squares in a string'' briefly described a
linear-time algorithm for computing the minimal squares starting at each
position in a word. Using the same construction of suffix trees, we generalize
his result and describe in detail how to compute in O(k|w|)-time the minimal
k-th power, with period of length larger than s, starting at each position in a
word w for arbitrary exponent and integer . We provide the
complete proof of correctness of the algorithm, which is somehow not completely
clear in Kosaraju's original paper. The algorithm can be used as a sub-routine
to detect certain types of pseudo-patterns in words, which is our original
intention to study the generalization.Comment: 14 page
Detecting One-variable Patterns
Given a pattern such that
, where is a
variable and its reversal, and
are strings that contain no variables, we describe an
algorithm that constructs in time a compact representation of all
instances of in an input string of length over a polynomially bounded
integer alphabet, so that one can report those instances in time.Comment: 16 pages (+13 pages of Appendix), 4 figures, accepted to SPIRE 201
- …